The Prosper Loans dataset is a collection of data about loans provided between September 2005 and March 2014 to US customers by the peer-to-peer lending company Prosper, based in San Francisco, California. Prosper matches borrowers and lenders: lenders can decide to offer a loan based on borrowers’ credit scores, ratings, and the category of the loan (source: https://en.wikipedia.org/wiki/Prosper_Marketplace).
In this exploratory analysis I’ll try to find out who are the typical Prosper Loans customers, for what reasons they ask for a loan, if they are mostly able to repay the loan or if they don’t.
Let’s import the csv and see how the dataframe is structured
## [1] 113937 81
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
There are 113937 rows and 81 columns in all the dataset.
In this section I will explore the main variables in the dataset.
Why customers do ask for a loan at Prosper? Let’s see the different loan categories in the Listingcategory columns. The class of the column is numeric, but according to the documentation for the different columns the numbers can be mapped to different categories, so I’ll convert the numbers to factors with the names of the categories.
Let’s now see how the customers are distributed in the different categories and plot the data.
##
## Not Available Debt Consolidation Home Improvement
## 16965 58308 7433
## Business Personal Loan Student Use
## 7189 2395 756
## Auto Other Baby&Adoption
## 2572 10494 199
## Boat Cosmetic Procedure Engagement Ring
## 85 91 217
## Green Loans Household Expenses Large Purchases
## 59 1996 876
## Medical/Dental Motorcycle RV
## 1522 304 52
## Taxes Vacation Wedding Loans
## 885 768 771
The plot above shows the different categories of loans in decreasing order of number of requests (note that the bar lenght is proportional to the log of the count). The category with the highest number loan requests is “Debt Consolidation” which is the request of loans for repaying other loans, a sign that maybe the economical situations of Prosper Loans clients is quite complicated. The second and third category are unfortunately “Not Available” and “Other”, which are clearly not very informative, a sign that for understanding better this dataset other investigations are required. The first significative category after “Debt Consolidation” is “Home Improvement”, followed by “Business” and “Auto”. The fact that the first category is “Debt Consolidation” and that typical categories for requesting a loan, like purchasing an home or student loans, aren’t the most important in the Prosper Loans dataset, seems to indicate Prosper Loan could be a company specialized in offering opportunity of loans to users with already other kind of loans.
Considering that the first category for requesting a loan is Debt Consolidation, let’s now see how many credit lines Prosper customers have already opened at the moment the new loan was requested.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.00 9.00 9.26 12.00 54.00 7604
## [1] 5.022644
The graph above represents the distrubution of the number open credit lines at the moment the new loan was requested. It is clear that most of the clients have already more than one credit line opened, the median is 9, with a standard deviation of about 5, and there is clearly a mode at 8.
50% of Proser Loans clients have between 6 and 12 credit lines opened and 25% have 12 or more credit lines opened. The distribution is skewed to the right, due to a tail of customers with high number of credit lines opened.
How big is the typical amount of the loan? Let’s look at data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The graph above shows how the original amounts for the loans are distributed. Smaller loans under 7000$ are more numerous respect to loans with higher amounts. In the region lower amounts there is a peak at 4000$. Locally the number of loans increases for round amounts like 10,000$, 15,000$, 20,000$ and 25,000$. The lowest loans have a value of 1000$ and the bigger ones are of 35,000$. For this distribution the median is a better predictor than the mean, infact the median is 6500$, while the mean is 8337$, due to the number of bigger loans in the tail of the distribution.
How big is the montly loan payment? Looking at the monthly loan payment, the summary shows a minimum payment of zero. I’m going to remove the zero values to have better estimate of how big really is the monthly amount for the customers who are effectively paying their debt.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2251.5
The distribution of the monthly loan payments is skewed to the right, with a mode around 200$. The median of the distribution is 217.7$. Most of the loans have a monthly payment below 600$.
The income range of customers is classified in range categories, let’s now see the number of customers in every category.
## [1] "$0" "$1-24,999" "$100,000+" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed" "Not employed"
The graph above shows the Income Range for Customers. Most of them are in the range of income between 50,000$ and 74,999$ or between 74,999$ and 99,999$. Only a tiny number of customers have 0$ of income or are not employed
An important element for understanding the impact of the debt on the customer and to predict if the debt can be ever be repaid is the debt to income ratio: is the debt only a reasonable fraction of the income, or maybe is it a multiple? Let’s see it.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The graph above shows how the ratio between debt and income of customers is distributed. Most of debts are well below the income of customers, so they are only a fraction of the income: for most of customers the debt is under 22% of their income (the median is at 0.22 and the mean is 0.276), but there is a long tail of customers whose the debt is instead a multiple of the income, some of them have a debt 10 times bigger than their income or more (there is a cap at 10 in the data). Considering the the maximum amount of the loan is 35,000$ this is probably due to low income.
At Prosper a loan can be in different statuses: it can be completed (it has been completely repaid), current (the customer is paying the debt on time), chargedoff (a status for debts unlikely to be collected), defaulted (officially declared not repayble), past due plus the number of days (for customers who are late on payments), cancelled and lastly the loan can be in the category of the final payment. Let’s see the proportion of customers in each category.
## # A tibble: 12 x 2
## LoanStatus customer_proportions
## <fct> <dbl>
## 1 Cancelled 0
## 2 Chargedoff 0.105
## 3 Completed 0.334
## 4 Current 0.497
## 5 Defaulted 0.044
## 6 FinalPaymentInProgress 0.002
## 7 Past Due (>120 days) 0
## 8 Past Due (1-15 days) 0.007
## 9 Past Due (16-30 days) 0.002
## 10 Past Due (31-60 days) 0.003
## 11 Past Due (61-90 days) 0.003
## 12 Past Due (91-120 days) 0.003
The plot above shows the proportion of customers for different loan status. Nearly 50% of Prosper Loan clients in the dataset have a current loan going on and about 33% have completed their payments. The number of clients who are late with the payments is very tiny and near zero, but about 10% of loans are considered chargedoff, meaning there is a very tiny hope the debt can be collected because the client is severely delinquent on a debt (creditors will make this declaration at the point of six months without payment) and 4% are defaulted,meaning there is even less hope for the debt to be repaid.
Considering the previous plot, it seems sometimes customer are late on the payments, but how many months does a Prosper Loan last?
##
## 12 36 60
## 1614 87778 24545
It seems there are only three main durations for a loan: 12, 36 and 60 months.
The plot above shows that most of customers have a loan with a total duration of 36 months, the second category is 60 months and the last is 12 months.
Prosper is peer-to-peer lending company: every loan can be funded by one or more investors, some of them can be labeled as “friends”, let’see how many investors a typical loan has and how many of the loans are funded by friends.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
## [1] 103.239
## [1] 0.2441174
The plot above shows the distribution of the number of investors per loan, on the y axis the lenghts of the bar are on a log scale. There is huge variability in the number of investors involved in a loan, they range between a minimum of 1 to amaximum of 1189, with a notable standard deviation of 100. About 24% of loans are funded by just one investors, but there are loans with many investors. The plot shows a clear tendency: loans with more investors are usually less numerous than loans with less investors.
Let’s now look at the proportion of loans funded by friends.
##
## 0 1 2 3 4 5 6 7 8 9
## 111806 1835 215 40 15 8 4 2 3 5
## 13 15 20 33
## 1 1 1 1
## [1] 0.01870332
About 2% of loans are funded by friends and most of them just by one friend, as shown by the table function on the InvestmentFromFriendsCount column.
Borrower rate can be a crytical factor for succesfully repaying a loan, let’s now see the the distribution of interest rates for the customers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The plot above shows the interest rate for loans at Prosper. The median of the distribution is at 0.1840 which is indeed quite high for a loan. Moreover there are local peaks for even higher interest rates over 0.25 and at about 0.33: such high rates are indeed illegal for a loan in some countries,not even considering the maximum rate 0.49 which is really incredibly high.
We saw that a high number of customers ask for a loan at Prosper for repaying some other debt. How many of them are already late for repaying a debt? How many delinquencies (not repaying a debt in time) did they have in the last 7 years?
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.5921 0.0000 83.0000 697
## [1] 0.2123542
About 21% of all the customers in the dataset are late on one or more payments, let’s now see how many “delinquencies” such customers have and how they are distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 2.853 3.000 83.000 697
The plot above shows the distribution of the number of delinquencies for customers with at least one delinquency. Most of those customers have more than one delinquency pending, but only 25% of them have 3 or more.
Let’s now see at the number of delinquencies in the last 7 years. How many Prosper customers had delinquencies in that period?
## [1] 0.3291117
About 33% of borrowers at Prosper had at least one delinquency in the last 7 year, this is in line with the main category for asking for loan which is “Debt Consolitdation”
Generally a home is considered a good investment or a good asset for requesting a loan, but it can also be a source of debt. Let’s now see how many customers are home owners.
##
## False True
## 56459 57478
The plot above shows the number of home owners and not home owners: customers seem failry divided into the two categories.
Let’s now see the employment status.
## [1] "" "Employed" "Full-time" "Not available"
## [5] "Not employed" "Other" "Part-time" "Retired"
## [9] "Self-employed"
There are 9 categories selectable as employment status: Employed, Full-Time, Not Available, Not employed, Part-time, Retired and Self-employed and a category with no definition, left blank for which we have no clue what it could represent and therefore I’ll rename it “Not available”"
## [1] "Not available" "Employed" "Full-time" "Not employed"
## [5] "Other" "Part-time" "Retired" "Self-employed"
The graph above shows the employment status for Prosper customers: most of them are employed, 22% define themselves full-time.
At every loan is associated a score with a value from 1 to 11, with 11 being the best, or lowest risk score. This score is present only for loans since July 2009, for previous dates the value is NA. Let’s see how many customers there are in each category.
Please note: the documentation linked above in this analysis states that for the variable ProsperScore the range is from 1 to 10, but according to this link on the Prosper website is instead form 1 to 11, which is in accord with what found in the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
The graph above shows the Prosper Score given to customers since 2009, with 1 being the worst and 11 being the best. Only a tiny faction of customers are rated as best or worst, but a significative portion of customers have low ratings, with 4 being the most popular rating. The mean is 5.95 very close to the median of 6.
We saw that investments in lending money at Prosper can be risky, there is always a risk connected with a loan. The risk can be summarized by the so called Estimated Loss , which is the sum of the values of all possible losses multiplied for the probability of the loss occurring. (see here for more details) Luckily there is a variable in the dataset estimating this value (EstimatedLoss), let’s see how it is distributed. Note:this value is applicable only to loans after July 2009.
summary(loans$EstimatedLoss)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.042 0.072 0.080 0.112 0.366 29084
The graph above shows the estimated loss for loans (a variable present only for loans after July 2009). Most of loans have an estimated loss below 0.10, the median for the distribution is 0.072, but there are a few very risky loans with an estimated loss over 0.3, tthe maximum is at 0.366
The Prosper Loans dataset consists of 113937 rows and 81 columns describing loans given by Prosper a peer-to-peer lending company.
I have investigated in particular the variables described below:
The ListingCategory column lists the categories for the loan requests, the most popular is Debt Consolidation“, followed by”Not availble“,”Other" and “Home Improvement”.
The IncomeRange column is used to classify customers according to their income: most of customer have an income over 50,000$, the most popular income category is the bucket for 50,000-74,999$, but some of the customers have also an income of 0.
The LoanStatus column describes in which state is the loan: most of loans are in the “Current” or “Completed” category, so they are currently being repaid or have being totally repaid, but 10% of loans are Chargedoff (they are unlikely to be repaid) and 4% are defaulted(even less hope to be ever repaid).
The HomeOwnership column describes if a customer owns a home: about 50% of the customers are home owners.
The EmploymentStatus column describes the category of employment of the customers: most of customers are in the category “Employed”“.
The OpenCreditLines column is a numeric variable for the number of current credit lines of the customer at the times the credit was pulled: most of customers have more than one credit line opened, with a median and a mean of about 9.
The LoanOriginalAmount column lists the amount of the loan in origin: it ranges between a minimum of 1000 and a maximum of 35,000$. The most popular loan is about 4,000$. Other popular loans are: 10,000$, 15,000$, 20,000$ and 25,000$ (notably all round numbers).
The MonthlyLoanPayment column lists the montly payment for the loan: the mean is 272.5 and most of payments are under 600$, though the maximum monthly amount is 2251.5$.
The LoanDuration column lists only 3 durations for the loan expressed in months:12, 36 or 60 (so 1, 3 or 5 years).
The DebtToIncomeRatio column lists the ratio debt/income: for most of customers the debt is under 22% of their income (the median is at 0.22 and the mean is 0.276), but there is a long tail of customers whose the debt is instead a multiple of the income, some of them have a debt 10 times bigger than their income or more (there is a cap at about 10 in the data). Considering the the maximum amount of the loan is 35,000$ this is probably due to low income.
The Investors column lists the number of investors for each loan. There is huge variability in the number of investors involved in a loan, they range between a minimum of 1 to amaximum of 1189, with a notable standard deviation of 100. About 24% of loans are funded by just one investors, but there are loans with many investors.
The InvestmentFromFriendsCount column lists the number of friends who funded the loan: only 2% of loans were funded by at least 1 friend.
The BorrowerRate column lists the interest rate for every loan. The median of the distribution is at 0.1840 and the mean is about 0.19, which are quite high, moreover there are local peaks for even higher interest rates at over 0.25 and at about 0.33. The maximum intererst rate is very high:0.4975.
The CurrentDelinquency column list the number of debt for which the customer is late on payment: we saw that about 21% of all the customers are late on at least one payment.
The DelinquenciesLast7Years column list the number of delinquencies in the last 7 years: nearly 33% of custmers had one delinquency in the last 7 years.
The ProsperScore columns is the list of risk scores for borrowers, ranging from 1 (worst) to 11 (best). The score has been used only since July 2009. The mean score for Prosper customers is about 6 which also the median, only a minority of customers have the worst or the best score.
The EstimatedLoss columns lists a variable that give an estimate of the loss associated with the loan: it is the sum of the values of all possible losses multiplied for the probability of the loss occurring.
The first and most interesting feature in this dataset is the ListingCategory of the loans, because it gives a big hint on who are the Prosper Loan Customers: the greatest category is “Debt Consolidation”, so many users are asking for a loan in order to repay other debts.
Looking at the number of other credit lines opened, we saw that most of the customers have more then one and that 21% of them have at least one current delinquency ( that’s to say they are late on the payment) and 33% had at least one delinquency in the last 7 years.
But why do they ask for a loan at Propser? Do Prosper have good interest rates? Looking at column for the interest rate the answer is a big no, interest rate at Prosper are quite high, some of them are over 25%. If we consider the ProsperScore, rating the risks associated with the loans, we see that half of the customers have low scores, under 6 on a scale ranging from 1 to 11, where 11 is the maximum. So it could be that Prosper customers are in general considered risky and they maybe resort to requesting a loan at Prosper, despite the interest rates, because they are in need of money difficult to obtain due to the risk associated with their profile. This interpretation of potentially risky borrowers may be confirmed by the data in the LoanStatus column: 10% of all the loans are Chargedoff (a definition usually applied to debt very diffucult to collect) and 4% are defaulted (no hope or very tiny hopes to see the lended money again), so about 14% of borrowers in total had been a bad deal for the investors. So why do investors lend money to borrowers? Again a look at the interest rate could give us a hint (but clearly not a definitive answer, because we may only make hypothesis): interest are very high, the possibility of profits maybe obscures the risks for some investors. The reason for lending money to borrowers seem to be friendship only for 2% of the cases, as we saw analyzing the InvestmentFromFriendsCount column.
In the next part of the analysis I’ll try to see how the original amount of the loan is influenced by home ownership and employment status and what are the features that are associated with more risky loans.
I reordered the levels in the IncomeRange columns to have more consistency in plotting, ordering them from lowest to highest.
I used a square root transformtion for plotting the debt to income ratio, because most of the ratios are below 0.5, but the scale on the x axis ranges from 0 to 10, so the high number of loans in the very first section of the plot was obscuring the loans with very high ratios and low numbers on the right side of the plot (they were impossible to see), but with the transformation they had become visible again.
I used a log10 transformation on the y axis for the plot of the distribution of the number of investors and of the current delinquencies, because a very high number of loans transformation are on the left part of the plot, obscuring the right side with a lower number of loans, but after the transformation they were visible again.
I incorporeted one of the levels for the EmploymentStatus column in “Not available” because it was simply listed as “” (empty string).
We saw that one of the main category of loans is “Debt Consolidation”, but is it also the more funded? Let’see it.
In the plot above the various listing categories for loans are listed in order of the median for the amount of the loan. So the plot represents from top to down the categories that are more likely to obtain bigger loans. Again “Debt Consolidation” is on top, so it is likely to obtain bigger loans, but as surprise when it comes to the amount of the loans new categories emerge, the second one is Baby&Adoption and the third is Wedding Loans. The undescript loans listed in “Other” or “Not Available” are now much less important in this graph. Students get the lowest loans.
Do Home Owners get bigger loans? A home is a sort a guarantee, so we expect the answer to be yes.
The plot above represents the loan original amount for the categories of home owners and not home owners. It is clear home owners get the bigger loans, so our expectations are confirmed.
Who will get the bigger loans? Employed, Full-time or Self-Employed?
The plot above shows, as expected, that Employed borrowers are more likely to get bigger loans, followed by Self-Employed and Full-time. Not-employed seem to receive bigger loans respect to Part-Time and Retired, and this is a bit surprisingly, but it could also due to a minor propension of such categories to ask for bigger loans.
Do customers with higher incomes ask for bigger loans?
The plot above shows the Income Range for customers ordered according to increasing median for the amount of the original loan. The category of Income Range that obtain bigger loans is the one between 25,000-49,999$, while the richest category for income range, over 100,000$ is only second.
Defaulted and Chargedoff loans were bigger than the other loans? Let’s see it.
The plot above show the loan original amount for loan status. The plot is really interesting because Defaulted, ChargedOff and Completed Loans seem to have approximately the same distribution, so customers who in the end haven’t been able to repay the debt didn’t ask for bigger loans respect to customers who instead have already finished to pay.
We saw that the loan original amount seem to be not very influential in determining the loan status, because completed, defaulted and chargedoff loans were on a very similare range of amounts, but maybe the interest rate could be different for those categories and make a difference for the success of loan, meaning for success not ending in chargedoff,defaulted or in a delinquent status.
The plot above shows the interest rate for loan status, sorted in increasing order of the median. The graph this time is really interesting because all the so called “delinquencies”, charged off and defualted loans seem to have a bigger interest rate respect to the completed, current or with final pyament in progress loans, so interest rate could be really important in determining the success of the loan.
Interest rate at Prosper have a great variabilty: they range from low to very high values. How the original amount is correlated to the Interest Rate? Let’ first calculate the correlation.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and BorrowerRate
## t = -117.58, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3341283 -0.3237719
## sample estimates:
## cor
## -0.3289599
It seems the loan original amount is only moderately negatively correlated to the Borrower Rate (or interest rate for the borrower). Let’s see it in plot.
The plot above show the Original Amount of the Loan. There are many points, but it is possible to notice that as the interest rate increases, the amount of the loan seem to decrease (there are less points in the upper corner of the plot), the red line on the plot shows the general trend.
Maybe there are other factors that are more influent on the Interest Rate, it would be logical to expect that higher Debt to Income Ratio would be considered more risky, with probably higher interest rates. Let’s see if it’s true, first calculating the correlation between the two variables
##
## Pearson's product-moment correlation
##
## data: DebtToIncomeRatio and BorrowerRate
## t = 20.465, df = 105380, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05690080 0.06892819
## sample estimates:
## cor
## 0.06291678
There is a very weak positive correlation between the two variables, so it seems that the borrower rate isn’t very important in determining the Interest Rate for the loan.
The plot above shows the nearly absent correlation for Interest Rate and Debt to Income Ratio, the red line is nearly horizontal.
##
## Pearson's product-moment correlation
##
## data: MonthlyLoanPayment and OnTimeProsperPayments
## t = 5.2739, df = 22083, p-value = 1.348e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02228905 0.04863337
## sample estimates:
## cor
## 0.03546737
The estimated loss is a measure of the possible loss of money associated with the loan, so it is possible it could influence the interest rate. Let’s see the plot of Estimated Loss against Borrower rate and calculate their correlation.
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and EstimatedLoss
## t = 844.11, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9445764 0.9460085
## sample estimates:
## cor
## 0.945297
The correlation here is really high, 0.945 so we expect the loan to have higher interest rates associated at high estimated losses.
The plot above shows the Interest Rate versus the Estimated Loss, as the estimated loss increases it is clear that also the Interest Rate increases. Note: the estimated loss variable is present only for data after July 2009.
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and EstimatedLoss
## t = -138.69, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4353596 -0.4243895
## sample estimates:
## cor
## -0.4298904
There is a moderate negative correlation between the estimated loss and the original amount of the loan.
Let’s plot the estimated loss against the original amount of the loan
The plot above shows the estimated loss against the original amount of the loan. It seems bigger loans over 15,000$ are obtained only when the estimated loss is under 0.1 and for the loans over 25,000$ (25,000 not included) when loss is mostly under 0.05.
In this section I tried to understand what are the variables influencing the loan original amount and the interest rate, two elements that are really important for borrowers and in determinig the success of the loan, meaning for success a loan that is repaid completely. I therefore considered home ownership,employment status and the the income range of the borrowers.
Considering the distribution of the loan, the first three bigger categories are “Debt Consolidation”, “Baby&Adoption” and “Wedding Loans”.
The borrowers seem to be nearly fairly divided in home owners and not home owners, but we saw that home owners get bigger loans.
Considering instead the employment status, “Employed” and “Self-Employed” seem to get loans bigger than other categories, with “Part-Time” wokers getting instead the lowest ones.
Considering the income range of the borrowers the bigger loans are obtained by the customers in the range of 25,000 - 49,999$ and and then by the customers in the highsest range of income (over 100,000$). Customers who didn’t declared their income obtain in general lower loans.
Looking at the loan original amount and confronting it with the status of the loan, I’ve found something interesting: I would expect that loans in a bigger range of amount would be more likely to end up as defaulted or chargedoff, instead I saw that the distribution of amount for defaulted and chargedoff loans was very similar to the one of the completed loans, meaning that in itself the total amount of the loan isn’t so important as I could have expected.
Confronting the Interest Rate with the Loan Status i foudn again something intersting: the interest rate is very critical for determining the success of the loan, infact all the status associated with problems (delinquencies, chargedoff and defaulted) had a distribution of interest rate shifted towards higher values respect to the ones that were on time with payments or completed.
Plotting the interest rate against the loan original amount I saw that bigger loans tend to have lower interest rate, while there is nearly no correlation between the interest rate and the debt to income ratio.
A critical factor in determining the interest rate rate seems to be the estimated loss associated with the loan, at lower interest estimated losses correspond lower interest rates and as the estimated loss increases so increases the interest rate.
Lastly there is moderate negative correlation between the estimated loss and original amount of the loan: the bigger loans over 25,000$ seem to have lower estimated loss mostly below 0.05
The strongest relationship that I found was a strong positive correlation of 0.945 between the interest rate and the estimated loss for the loan.
We saw earlier that the employmente status seems to influence the original amount of the loan in favor of employed and self-employed customers, let’s now see how the combination of borrower rate and original amount of the loan differ for different employment status and how many home owner there are in each category of employment.
The graph above shows the Loan Amount versus the Interest Rate for different working categories. As we saw earlier Employed have the higher loans, but now we see that those higher loans come with lower interest rate and that between those who obtain higher loans at low rates there are mostly home owners. Not employed get lower loans and they are mostly not home owners, but even between not employed home owners get the highest loans of the category. Part-time workers have as well lower loans and they are mostly not home owners. We saw earlier that Self-Employed were the second classified in obtainig bigger loans, and the plot shows that those who obtain bigger loans at low rates are mostly home owners. It seems home ownership is influential in obtaining higher loans at lower rates.
Let’s see how the combination of Prosper Score and Interest Rate differs between Owners and Not Home Owners. The Prosper score is a value between 1 and 11 measuring the reliability of the borrower with 1 being the worst and 11 being the best.
The plot above represents the Proseper Score (ranging from 1 to 11 with 1 being the worst and 11 being the best) versus interest rates. As we saw earlier only a minority of customers get the highest score of 11, for those customers the interest rate is mostly below 0.15 and they seem to be mostly home owners. In general the right side of each category, with highest interest rates, seem to be more red, the color associated with not home owners.
I’m going to plot the Monthly Loan Payment versus Interest Rate for four categories: the Defaulted and Chargedoff (representing bad loans) loans, and Completed and Current (loans being paid) loans.
The plot above shows the montly loan payment versus the interest rate for four categories of loans : ChargedOff, Completed, Current and Defaulted, showing at the same time home owners and not home owners. It is interesting to note that on the right side of every plot, where there are the highest interest rates for every category there are more red dots, so more not home owners, while on left side there are always more blue, so home owners. The plot for the current loans shows that in this category there more loans with higher monthly paymetns at lower interest rates and they are mostly of home owners.
In the previous section of the analysis I saw that the interest rate was critical for the success of the loan, in this section I have investigated how interest rate and home ownership are related to employment status, current status of the loan and to score given by Prosper to customers.
Plotting the loan amount versus the interest rate for different working categories, I saw that home owners were more likeky to obtain higher loans at low interest rates, while between part-time workers, who obtain the lower loans, there were mostly not home owners
Plotting the Prosper Score against Interst Rate and coloring home owners and not home owners differently, showed that home owners are the ones more likeky to obtain the highest score. In general in every category of score, those with higher interest rates were mostly not home owners.
Plotting the monthly loan payment versus interest rate for four categories (Defaulted,Chargedoff, Completed and Current) showed that again in every category home owners got the lowest interest rates while for high interest rates not home owners were mostly present.
The fist plot describes the current status of the loan for Prosper Loans borrowers, showing the proportions of client for each category. About 10% of the customers have a chargedoff loan (a declaration for loans unlikely to be collected) and 4% are defaulted (so is nearly impossible the debt will be every repaid). So 14% of the loans at Prosper turned to be bad investement for the lenders. I have chosen this plot because I think illustrates well this kind of investment can be risky.
The second plot describes the distribution of the interest rates for different loan statuses sorted in ascending order of the median. It is interesting to note that the distributions of interest rate for loans in bad categories (chargedoff, defaulted or late on the payments) are all shifted towards higher values than the ones for the completed, current on with final payment in progress. So interest rates may have an influence on a loan investement turning bad or good.
The third plot shows the Original Amount of the Loan versus the interest rate for employment status of the borrowers. We saw that loans can be risky at Prosper, so this plot shows who are the borrowers who are likely to get the best loans, the ones with high amount and low rates, so the loans given to the borrowers probably less risky. As expected the higher loans go to the Employed Category, who have most of the loans over 20,000$ with relatively low rates. The borrowrers in the employed category are mostly home owners, a sign maybe that the ownership of a home is good asset for obtaining a loan. The “Not employed” and the “Part-Time Category” are mostly composed of not owners and most of their loans are under 10,000$. Self-employed who are also home owners seem to get again higher loans at lower rates, while in the Full-time caegory home ownership seem important for obtaining higher loans, but not lower rates, because home owners who have the highest loans have interest rates across all the range.
The Prosper Loans dataset contains informations over more than 113,000 loans requested to Prosper, a peer-to-peer landing company based in San Francisco, California, between 2005 and 2014. The main category for request of loans at Prosper is “Debt Consolidation”, so many borrowers ask for a loan in order to repay other debts. The amount of the loan requested present int the dataset is between 1,000$ and 35,000$ and the debt can be repaid in 12, 36 or 60 months. At every loan is associated a status: 10% of the loans are considered chargedoff, while 4% are defaulted, so at least 14% of all the debt turned to be bad investments for the lenders. The interest rate for a loan at Prosper seem to be very high, the median being around 0.18 and the mean at around 0.19, but it can potentially be as high as nearly 0.50 (which is illegal for a loan in many countries). In my analysis I saw that the interest rate may bevery important in determining the success of a loan and that at loans with a lower estimated loss is also associated a lower interest rate. About half of the Prosper Loans Customers are also home owners and home owners seem to shine between borrowers: they in general tend to have higher loans at lower rates, this rend is more evident forborrowers who are employed or self-employed.
My investigation across this dataset has indeed many limits: this is certainly only a fraction of all the transactions at Prosper and I have not many information on how the data has been pulled from the Prosper database. I found a range for the loans between 1,000$ and 35,000$, but I don’t know if this is really a cap for the minimum and maximum for loans or only a choice made by who created the original csv file. Some of the variables like the Prosper Score are present only for loans after the year 2009. Moreover we are in 2019 and the last record in the dataset is from 2014, so many factors as interest rates or distributions of incomes, and so on, may be different in the present.
My evaluation about eventual risk associated with the loans are relative only to this data that may be incomplete.